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Abstract This article presents new algorithms for mas¬ 
sively parallel granular dynamics simulations on dis¬ 
tributed memory architectures using a domain parti¬ 
tioning approach. Collisions are modelled with hard 
contacts in order to hide their micro-dynamics and thus 
to extend the time and length scales that can be simu¬ 
lated. The multi-contact problem is solved using a non¬ 
linear block Gauss-Seidel method that is conforming to 
the subdomain structure. The parallel algorithms em¬ 
ploy a sophisticated protocol between processors that 
delegate algorithmic tasks such as contact treatment 
and position integration uniquely and robustly to the 
processors. Communication overhead is minimized through 
aggressive message aggregation, leading to excellent strong 
and weak scaling. The robustness and scalability is as¬ 
sessed on three clusters including two peta-scale super¬ 
computers with up to 458 752 processor cores. The sim¬ 
ulations can reach unprecedented resolution of up to ten 
billion (10 1Q ) non-spherical particles and contacts. 
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1 Introduction 

Granular matter exhibits intriguing behaviours akin to 
solids, liquids or gases. However, in contrast to those 
fundamental states of matter, granular matter still can¬ 
not be described by a unified model equation homoge¬ 
nizing the dynamics of the individual particles [25] . To 
date, the rich set of phenomena observed in granular 
matter, can only be reproduced with simulations that 
resolve every individual particle. In this paper, we will 
consider methods where also the spatial extent and ge¬ 
ometric shape of the particles can be modelled. Thus 
in addition to position and translational velocity the 
orientation and angular velocity of each particle consti¬ 
tute the state variables of the dynamical system. The 
shapes of the particles can be described for example 
by geometric primitives, such as spheres or cylinders, 
with a low-dimensional parameterization. Composite 
objects can be introduced as a set of primitives that 
are rigidly glued together.Eventually, even meshes with 
a higher-dimensional parameterization can be used. In 
this article the shape of the particles does not change 
in time, i.e. no agglomeration, fracture or deformation 
takes place. The rates of change of the state variables 
are described by the Newton-Euler equations, and the 
particle interactions are determined by contact models. 

Two fundamentally different model types must be 
distinguished: Soft and hard contacts. Soft contacts al¬ 
low a local compliance in the contact region, whereas 
hard contacts forbid penetrations. In the former class 
the contact forces can be discontinuous in time, lead¬ 
ing to non-differentiable but continuous velocities after 
integration. The differential system can be cast e.g. as 
an ordinary differential equation with a discontinuous 
right-hand side or as differential inclusions. However, 
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the resulting differential system is typically extremely 
stiff if realistic material parameters are employed. 

In the latter class, discontinuous forces are not suffi¬ 
cient to accomplish non-penetration of the particles. In¬ 
stead, impulses are necessary to instantaneously change 
velocities on collisions or in self-locking configurations 
if Coulomb friction is present [33J . Stronger mathemat¬ 
ical concepts are required to describe the dynamics. For 
that purpose, Moreau introduced the measure differen¬ 
tial inclusions in m- 

Hard contacts are an idealization of reality. The 
rigidity of contacts has the advantage that the dynam¬ 
ics of the micro-collisions does not have to be resolved 
in time. However, this also introduces ambiguities: The 
rigidity has the effect that the force chains along which 
a particle is supported are no longer unique [25] . If en¬ 
ergy is dissipated, this also effects the dynamics. To 
integrate measure differential inclusions numerically in 
time, two options exist: In the first approach the inte¬ 
gration is performed in subintervals from one impulsive 
event to the next [251 G3]- At each event an instan¬ 
taneous impact problem must be solved whose solu¬ 
tion serves as initial condition of the subsequent inte¬ 
gration subinterval. Impact problems can range from 
simple binary collisions, to self-locking configurations, 
to complicated instantaneous frictional multi-contact 
problems with simultaneous impacts. The dynamics be¬ 
tween events are described by differential inclusions, 
differential algebraic equations or ordinary differential 
equations. Predicting the times of the upcoming events 
correctly is non-trivial in general and handling them in 
order in parallel is impeding the scalability [25] . In the 
second approach no efforts are made to detect events, 
but the contact conditions are only required to be sat¬ 
isfied at discrete points in time. This approach is com¬ 
monly referred to as a time-stepping method. 

This article focuses on the treatment of hard con¬ 
tacts in order to avoid the temporal resolution of micro¬ 
collisions and thus the dependence of the time-step length 
on the stiffness of the contacts. In order to avoid the res¬ 
olution of events a time-stepping method is employed. 
This considerably pushes the time scales accessible to 
granular flow simulations for stiff contacts. 

To estimate the order of a typical real-life problem 
size of a granular system, consider an excavator bucket 
with a capacity of 1 m 3 . Assuming sand grains with 
a diameter of 0.15 mm, and assuming that they are 
packed with a solid volume fraction of 0.6, the exca¬ 
vator bucket contains in the order of 10 10 particles. In 
such a dense packing the number of contacts is in the 
same order as the number of particles. Only large scale 
parallel systems with distributed memory can provide 
enough memory to store the data and provide sufficient 


computational power to integrate such systems for a 
relevant simulation time. Consequently a massive par¬ 
allelization of the numerical method for architectures 
with distributed memory is absolutely essential. 


In the last half decade several approaches were pub¬ 
lished suggesting parallelizations of the methods inte¬ 
grating the equations of motion of rigid particles in hard 
contact [38l[39l[2Ql|34l[T5l[T6l[28]. The approach put for¬ 
ward in this article builds conceptually on these previ¬ 
ous approaches but exceeds them substantially by con¬ 
sistently parallelizing all parts of the code, consistently 
distributing all simulation data (including the descrip¬ 
tion of the domain partitioning), systematically mini¬ 
mizing the volume of communication and the number 
of exchanged messages, and relying exclusively on ef¬ 
ficient nearest-neighbor communication. The approach 
described here additionally spares the expensive assem¬ 
bly of system matrices by employing matrix-free com¬ 
putations. All this is accomplished without sacrificing 
accuracy. The matrix-freeness allows the direct and straight 
forward evaluation of wrenches in parallel and thus re¬ 
duces the amount of communicated data. Furthermore, 
an exceptionally robust synchronization protocol is de¬ 
fined, which is not susceptible to numerical errors. The 
excellent parallel scaling behaviour is then demonstrated 
for dilute and dense test problems in strong- and weak- 
scaling experiments on three clusters with fundamen¬ 
tally different interconnect networks. Among the test 
machines are the peta-scale supercomputers SuperMUC 
and Juqueen. The results show that given a sufficient 
computational intensity of the granular setup and an 
adequate interconnect, few hundred particles per pro¬ 
cess are enough to obtain satisfactory scaling even on 
millions of processes. 


In Sect. [2] of this paper the underlying differential 
equations and the time-continuous formulation of the 
hard contact models are formulated. Sect. [3] proposes 
a discretization scheme and discrete constraints for the 
hard contact model. The problem of reducing the num¬ 
ber of contacts in the system for efficiency reasons is 
addressed in Sect. [4j Subsequently, an improved nu¬ 
merical method for solving multi-contact problems in 
parallel is introduced in Sect. [5] before turning to the 
design of the parallelization in Sect. [6j The scalability 
of the parallelization is then demonstrated in Sect. [7] by 
means of dilute and dense setups on three different clus¬ 
ters. Finally, the algorithms and results are compared 
to previous work by other authors in Sect.[8]before sum¬ 
marizing in Sect. [9] 
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2 Continuous Dynamical System 


The Newton-Euler equations for a system with par¬ 
ticles are [22] 



( v (t) 

r(s{t),t) - u(t) x I(<p(t))ui(t) 


where the positions x(t ) £ R 3 " 6 , the rotations <p(t) € 
R 4 " b , translational velocities v(t) £ R 3 " b , and angular 
velocities u>(t) £ R 3 " b are the state variables at time t. 

Different parameterizations exist for the rotations, 
but quaternions having four real components are the 
parameterization of choice here. Independent of the pa¬ 
rameterization, the derivatives of the rotation compo¬ 
nents can be expressed in terms of a matrix-vector prod¬ 
uct between a block-diagonal matrix and the angular 
velocities m- If the rotation of particle i is described 
by the quaternion q w + q x i + q y j + q z k £ H then, ac¬ 
cording to uni, the i-tli diagonal block of Q is 


Q iiiViit)) — _ 


-Qx ~Qy -Qz 
Qw Qz -Qy 
Qz Qw Qx 
Qy Qx Qw . 


Each particle has an associated body frame whose 
origin coincides with the body’s center of mass and 
whose axes are initially aligned with the axes of the ob¬ 
servational frame. The body frame is rigidly attached 
to the body and translates and rotates with it. All of 
the state variables and other quantities are expressed 
in the observational frame unless noted otherwise. Fur¬ 
thermore, the matrix 


MM*)) 


diag mil 

*= l-Vb 

diag I u(<fii(t)) 

i=l..Vb 


is the block-diagonal mass matrix, where 1 denotes the 
3x3 identity matrix. The mass matrix contains the 
constant particle masses rrii and the particles’ inertia 
matrices I u{ipi{t)) about the particles’ centers of mass. 
The latter can be calculated by similarity transforma¬ 
tions from the constant body frame inertia matrices I 3 . 
If the body frames are attached such that they coincide 
with the principal axes of their particles, then the body 
frame inertia matrices are diagonal, and floating-point 
operations as well as memory can be saved. The lower- 
right block of the mass matrix corresponds to the ma¬ 
trix I(<p(£)). f(s(t),t) and T(s(t),t) are the total forces 
and torques (together they are referred to as wrenches) 
acting at the particles’ centers of mass. Both may de¬ 
pend on any of the state variables s(t) of the system 


and time t. The wrench contributions from contact re¬ 
actions are summed up with external forces f ex t and 
torques T ext such as fictitious forces from non-inertial 
reference frames. 

Let A j(t) £ R 3 be the contact reaction of a con¬ 
tact j £ C, where C = {1 .. u c } is the set of poten¬ 
tial contact indices. Let (j \, j 2 ) £ B 2 be the index pair 
of both particles involved in the contact j, where B = 
{1 .. Ub} is the set of body indices. Let Xj(x(t), <p(t)) £ 
R 3 be the location of contact j, then the wrench on 
body i is 




fi,ext{s(t),t ) 


+ £ 


-E 

i€C 


(%(x(4¥>(*)) 


Xj(t) 

AW, 


wrench contributions 


(1) 


where (•) x is a matrix, which when multiplied to a 
vector corresponds to the cross product between its 
operand (•) and the vector. 

In contrast to soft contact models, the contact re¬ 
actions in hard contact models cannot be explicitly ex¬ 
pressed as a function of the state variables but are de¬ 
fined implicitly, e.g. by implicit non-linear functions [21] , 
complementarity constraints mi, or inclusions [36] . 

In any case, the constraints distinguish between reac¬ 
tions in the directions normal to the contact surfaces 
and reactions in the tangential planes of the contact 
surfaces. The former are used to formulate the non¬ 
penetration constraints, and the latter are used to for¬ 
mulate the friction constraints. For that reason, each 
contact j is associated with a contact frame, where 
the axis rij(x(t), ip(t)) £ R 3 points along the direc¬ 
tion normal to the contact surface, and the other two 
axes tj(x(t),<p(t)) £ R 3 and oj(x(t),ip(t)) £ R 3 span 
the tangential plane of the contact. 

Let S, be the set of points in the observational frame 
defining the shape of particle i, and let fi(xi(t), ipi(t),y) £ 
R be the associated signed distance function for a point 
y in the observational frame. The signed distance func¬ 
tion shall be negative in the interior of Si. Assum¬ 
ing that all particles are (strictly) convex with suffi¬ 
ciently smooth boundaries, then for a pair of particles 
(ji, j' 2 ) involved in a contact j, the contact location 
Xj(x(t),cp(t)) is defined by the optimization problem 


Xj(t) := Xj{x{t),^(t)) 

arg min f h (x h (t), <p h (t), y), ( 2 ) 

Si 2 ( X J2 (*)»V=J 2 (*)>?/)< 0 


with associated contact normal 

rij{t) := nj(x(t),ip(t)) = V y f h (x h (t), cp j 2 (t), 
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pointing outwards with respect to Sj 2 and associated 
signed contact distance 

&(*) :=&(*(*)>¥>(*)) = (*),¥>* (*)>%(*)) 

which is negative in the case of penetrations. 

For convex particles each pair of bodies results in 
a potential contact, and thus the total number of con¬ 
tacts v c is limited by — 1)- Non-convex objects 

e.g. can be implemented as composite objects of con¬ 
vex particles. By convention a positive reaction in nor¬ 
mal direction is repulsive, and thus the contact reaction 
A j (t) acts positively on particle j\ and negatively on j 2 , 
thus explaining the signs in 0 - By applying the oppo¬ 
site reactions at the same point in the observational 
frame, not only the linear momentum can be conserved 
but also the angular momentum of the system. Con¬ 
servation of energy can only hold if the contact model 
does not include dissipative effects. Hard-contact mod¬ 
els require the Signorini condition to hold. Written as 
a complementarity condition for a contact j, it reads 

&(t) > 0 ± A jin (t) > 0, 

where A j, n (t) = rij(t) T The signed contact dis¬ 
tance is required to be non-negative, resulting in a non¬ 
penetration constraint. The contact reaction in direc¬ 
tion of the contact normal is also required to be non¬ 
negative, resulting in non-adhesive contact reactions. 
Furthermore, both quantities must be complementary, 
meaning that either of them must be equal to zero. This 
effects that the contact reaction can only be non-zero 
if the contact is closed. 

However, the Signorini condition does not determine 
the contact reaction force if the contact is closed. In 
that case the non-penetration constraint on the velocity 
level, 

i+(t) >01 A j<n (t) > 0 , 

must be added to the system, where is the right 
derivative of the signed contact distance with respect 
to time. The constraint allows the contact to break 
only if no reaction force is present and otherwise forces 
(t) = 0. In the latter case the reaction force is still 
not fixed. The non-penetration constraint on the accel¬ 
eration level, 

|+(f) > 0 1 A i>n (f) > 0, 

then determines the force also if ^ (t) = 0. When con¬ 
sidering impacts, a non-penetration constraint for the 
reaction impulse in the direction normal to the contact 
surface must be formulated, and, if the contact is closed, 
an additional constraint modelling an impact law such 
as Newton’s impact must be added. 


These non-penetration conditions can be comple¬ 
mented by a friction condition. The most prominent 
model for dry frictional contact is the Coulomb model 
which restricts the relative contact velocity in the tan¬ 
gential plane of the contact. The relative contact veloc¬ 
ity for a pair of particles (ji, J 2 ) involved in a contact j 
is 

5vf(s(t)) =vf(t)+uf(t) x (xj{x(i),<p(t)) - x h {t)) 

x (Xj(x(i),(p(t)) - X h (t)). 
Let 


to (t) ■■= 6v+ to (s(t)) 


\Oj{x{t),<p{t)) T Svf{s(t))J 


be the relative contact velocity in the tangential plane 
after application of the contact impulses, then the Cou¬ 
lomb conditions for a non-impulsive point in time t are 


IIAj, to (t )||2 < and 

However, if ||duA to (t )|| 2 = 0 these conditions must be 
supplemented by the constraint 

1 1 ^^j,to (^) 1 1 2 (t) = — 

on acceleration level in order to determine the friction 
force. Likewise constraints for the friction impulse are 
necessary. At this point we refrain from formulating the 
measure differential inclusion in detail since it would 
not contribute information essential to the remaining 
paper which only deals with the discrete-time system. 


3 Discrete Dynamical System 

In simulations of granular matter impulsive reactions 
are abundant. Higher-order integrators for time-stepping 
schemes are still subject to active research [5T|. In par¬ 
ticular, discontinuities pose problems for these integra¬ 
tors. Hence, the continuous dynamical system is dis¬ 
cretized in the following with an integrator of order one, 
resembling the semi-implicit Euler method and similar 
to the one suggested in [ 3 . 

Let s , x , ip, v and denote the given discrete-time 
state variables at time t and A the contact reactions at 
time t. Then the state variables at time t + St are func¬ 
tions depending on the contact reactions: s'(A), x'(X), 
v'(\) and u/(A). The discrete-time Newton- 
Euler equations integrated by the proposed scheme are 


_ fx\ 

V '(\)J - 

- f v \ 
«'(A )J W 


+ 5t 


( v'(\) \ 

{qMu/wJ ’ 




yr(s, A, t) — uj x I (ip)u) 


( 3 ) 
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Positions and orientations at time t + St appear exclu¬ 
sively on the left-hand side of the position and orien¬ 
tation integration. Velocities at time t + St appear on 
the left-hand side of the velocity integration and addi¬ 
tionally in the integration of positions and orientations. 
The numerical integration of the quaternion has the ef¬ 
fect that the quaternion gradually looses its unit length. 
This deficiency can be compensated by renormalizing 
the quaternions after each integration. 

Instead of discretizing each of the five intermittently 
active continuous-time complementarity constraints, the 
Signorini condition is only required to hold at the end 
of each time step. This has the effect that impulsive re¬ 
actions are no longer necessary to satisfy the condition 
since the condition is no longer required to be fulfilled 
instantaneously. Furthermore, the signed distance func¬ 
tion gets linearized, resulting in 

£j(i + St) = £j(t) + St£j(t) + 0(St 2 ), 

where the time derivative of the signed contact distance 
can be determined to be 

ij(t) = nj(t) T Svf(s(t)) 

under the assumption that the contact point Xj (t) trans¬ 
lates and rotates in accordance with body j' 2 , such that 

Xj(t) = Vj 2 {t) + U>j 2 (t) X (Xj{t) - x h (t)). 

Let the time-discrete relative contact velocity be 

Sv'j{\) =v' h (\) + u4(A) x (xj - x h ) 

-u' 2 (A) - w' 2 (A) x (xj - x j2 ), 

where the velocities are discretized implicitly. The dis¬ 
crete non-penetration constraint then is 

jj_ + nJSv'j(X) >01 A j tn > 0 . (4) 

The term acts as an error correction term if pen¬ 
etrations are present (fjj < 0). In that case it can be 
scaled down to avoid introducing an excessive amount 
of energy. If no numerical error is present, the contact 
is inelastic. The frictional constraints translate into 

11 Aj ; t 0 1 | 2 < Ab'Ajjri and 

II & v j, to (A) II2 Ay to = ~t i j^j,nSVj to (X). 

Let Fj( A) = 0 denote a non-linear system of equa¬ 
tions equivalent to the constraints from and © of a 
single contact j, and let F( A) denote the collection of 
all Fj( A). Neither F( A) = 0 nor Fj( A) = 0 for given 
Xj have unique solutions. Let Aj) be a possible 

solution of the one-contact problem of contact j, given 
the contact reactions A -■ of all other contacts j. 


A detailed discussion of solution algorithms for one- 
contact problems is out of the scope of this article. How¬ 
ever, splitting methods, where non-penetration and fric¬ 
tion constraints are solved separately, are prone to slow 
convergence or cycling. In [7] Bonnefon et al. solve the 
one-contact problem by finding the root of a quartic 
polynomial. Numerous other approaches exist for mod¬ 
ified friction laws, notably those where the friction cone 
is approximated by a polyhedral cone and solution al¬ 
gorithms for linear complementarity problems can be 
used in [so]. In any case the algorithm of choice should 
be extremely robust in order to successfully resolve v c 
contacts per iteration and time step, where zz c can be 
in the order of 10 ln in this article. 

4 Contact Detection 

The contact problem F{ A) = 0 has 0{v 2 ) non-linear 
equations. Thus, already the setup of the contact prob¬ 
lem would not run in linear time, much less the so¬ 
lution algorithm even if it were optimal. The contact 
constraints of a contact j can be removed from the sys¬ 
tem without altering the result if the contact is known 
to stay open (Aj = 0 ) within the current time step. Let 

Si(t) = {y G R 3 | fi(xi(t),tpi(t),y) < 0} 

be the set of points in space corresponding to the ro¬ 
tated and translated shape of particle i at time t and 
let 

?L:(f) = S 2 {t) + {y <E R 3 | |y|| 2 < hi(t)} 

be an intersection hull that spherically expands the par¬ 
ticle shape by the radius hi{t) > 0. If hi(t) is chosen 
large enough then an algorithm finding intersections be¬ 
tween the hulls can detect all contacts that can poten¬ 
tially become active in the current time step. A possible 
choice for the expansion radius is 

K(t) = (V(||?u(f)||2 + \\<jJi{t)\\ 2 fi) +r, (6) 

where rq = max y6 5 .( 0 )||y ||2 is the bounding radius of 
particle i, and r is a safety margin. The safety mar¬ 
gin becomes necessary since an explicit Euler step is 
underlying the derivation of ©>• In practice, the us¬ 
age of intersection hulls reduces the number of contacts 
considerably. E.g., monodisperse spherical particles can 
have at most 12 contacts per particle if the expansion 
radii are small enough j32|, resulting in Oivb) potential 
contacts. 

Broad-phase contact detection algorithms aim to 
find as few as possible candidate particle pairs for con¬ 
tacts by using e.g. spatial partitioning approaches or ex¬ 
ploiting temporal coherence of the particle positions [9]. 
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1: k i- 0 
2: A< fe ) <- 0 

3: while convergence criterion not met do 
4: for j «— 1 to v c do 

5: for l £ C do 

fi xtfc.U , j X i k+1) if l < j A s c (Z) = a c (J) 

6 ' Ai else 

7: end for 

8: V^F~ 1 ( 0,X<* lJ) ) 

9: Af +1) <- wy + (1 - w)Af } 

10: end for 

11: fc <- fc + 1 

12: end while 

Algorithm 1: The subdomain NBGS method with re¬ 
laxation parameter ui. 


The candidate pairs are then checked in detail in the 
narrow-phase contact detection, where © is solved for 
each pair, leading to the contact location Xj, normal rij 
and signed distance £ 7 - for a contact j. 

To solve ([2]) for non-overlapping particles, the Gil¬ 
bert-Johnson—Keerthi (GJK) algorithm can be used [IB, 

0]. For overlapping particle shapes the expanding poly¬ 
tope algorithm (EPA) computes approximate solutions [5] . 
For simple geometric primitives like spheres, the opti¬ 
mization problem can be solved analytically. The in¬ 
dices of all contacts found that way form the set of po¬ 
tential contacts C = {1 .. v c } at time t. Let F( A) = 0 
from now on denote the contact problem where all con¬ 
tact conditions and contact reactions whose indices are 
not part of C have been filtered out. 


5 Numerical Solution Algorithms 

To solve the multi-contact problem, when suitable so¬ 
lution algorithms for the one-contact problems F~ x 
are given, a non-linear block Gauss-Seidel (NBGS) can 
be used as propagated by the non-smooth contact dy¬ 
namics (NSCD) method [IS] . Unfortunately, the Gauss- 
Seidel algorithm cannot be efficiently executed in par¬ 
allel for irregular data dependencies as they appear in 
contact problems [20; . 

As an alternative, a more general variant is pro¬ 
posed here, accommodating the subdomain structure 
that will arise in the domain partitioning. Therefore, 
each contact j £ C is associated with a subdomain 
number s c (j) £ V , where V = {1 .. v p } is the set of 
subdomain indices for v p subdomains. Alg. [I] presents 
pseudo-code for the subdomain NBGS with the relax¬ 
ation parameter w > 0. The initial solution is chosen to 
be zero, however, any other initialization can be used, 
in particular contact reactions from the previous time 
step. 


The algorithm is of iterative nature and needs an 
appropriate stopping criterion to terminate. In each it¬ 
eration k a sweep over all contacts is performed, where 
each contact j is relaxed, given an approximation of all 
other contact reactions A^d) j n the subdomain NBGS, 
the approximation of contact reaction l is taken from 
the current iteration if it was already relaxed (l < j) 
and if it is associated with the same subdomain as the 
contact j to be relaxed (s c (0 = s c (j)). In all other 
cases, the approximation is taken from the previous it- 
eration. The contact reaction A) is then a weighted 
mean between the previous approximation and the re¬ 
laxation result. If all contacts are associated with the 
same subdomain and w = 1 then Alg. [I] corresponds to 
a classic NBGS. If each contact is associated to a differ¬ 
ent subdomain then Alg. [T] corresponds to a non-linear 
block Jacobi (NBJ) with relaxation parameter u. 


6 Parallelization Design 


Sect. |6.1] introduces the domain partitioning approach. 
Sect. |6.2| then discusses requirements that must be met 
in order to be able to treat all contacts exactly once in 
parallel. Sect. |6.3| explains how accumulator and cor¬ 
rection variables can be used in order to reduce data 


dependencies to other processes. In Sect. 6.4 conditions 


are discussed under which the set of communication 
partners can be reduced to the nearest neighbors. Time- 
integration and the subsequent necessity of synchro¬ 
nization are addressed in Sect. |6.5| before summarizing 
the time-stepping procedure in Sect. |6.6| 


6.1 Domain Partitioning 

Under the assumption that no contacts are present, 
there exists no coupling between the data of any two 
particles, and the problem becomes embarrassingly par¬ 
allel: Each process integrates I I or [^1 particles. 
Let Sf,(i) £ V determine the process responsible for the 
time-integration of particle i as of now referred to as the 
parent process. All data associated with this particle, 
that is the state variables (position, orientation, veloc¬ 
ities) and constants (mass, body frame inertia matrix, 
shape parameters), are instantiated only at the parent 
process in order to distribute the total memory load. 
However, contacts or short-range potentials introduce 
data dependencies to particles that in general are not 
instantiated on the local process nor on a process close 
to the local one, rendering a proper scaling impossible. 
A domain partitioning approach alleviates this prob¬ 
lem. 
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Let ft denote the computational domain within which 
all particles are located and fi p C 12, p £ "P, a family 
of disjoint subdomains into which the domain shall be 
partitioned. In this connection subdomain boundaries 
are associated to exactly one process. One process shall 
be executed per subdomain. The number of processes 
can e.g. correspond to the number of compute nodes in 
a hybrid parallelization or to the total number of cores 
or even threads in a homogeneous parallelization. In the 
domain partitioning approach the integration of a par¬ 
ticle whose center of mass Xi is located in a subdomain 
fi p at time t is calculated by process p. That way data 
dependencies typically pertain the local or neighboring 
subdomains since they are considered to be of short 
range. Let Sb(i) be adapted accordingly. Special care 
is required when associating a particle to a subdomain 
whose center of mass is located on or near subdomain 
interfaces. Especially, periodic boundary conditions can 
complicate the association process since the finite pre¬ 
cision of floating-point arithmetics does in general not 
allow a consistent parametric description of subdomains 
across periodic boundaries. Sect. |6.5| explains how the 
synchronization protocol can be used to allow a reliable 
association. 

The domain partitioning should be chosen such that 
an equal number of particles is located initially in each 
subdomain and sustained over the course of the simula¬ 
tion in order to balance the computational load which 
is directly proportional to the number of particles. Par¬ 
ticles now migrate between processes if their positions 
change the subdomain. Migration can lead to severe 
load imbalances that may need to be addressed by dy¬ 
namically repartitioning the domain. Such load-balancing 
techniques are beyond the scope of this article. 


6.2 Shadow Copies 

A pure local instantiation of particles has the effect that 
contacts cannot be detected between particles that are 
not located on the same process. A process can detect a 
contact if both particles involved in the contact are in¬ 
stantiated on that process. In order to guarantee that 
at least one process can detect a contact, the condi¬ 
tion that a contact j must be detected by all processes 
whose subdomains intersect with the hull intersection 
U h G ft y 2 is sufficient if the intersection of the hull in¬ 
tersection and the domain is non-empty. This condition 
can be fulfilled by the following requirement: 

Requirement 1 A particle i must be instantiated not 
only on the parent process but also on all processes 
whose subdomains intersect with the particle ’s hull. 


These additional instantiations shall be termed shadow 
copies in the following. They must be kept in synchro¬ 
nization with the original instantiation on the parent 
process. In order to agree upon the detecting process 
responsible for treating the contact without commu¬ 
nication a rule is needed. Here, the statement that a 
process is responsible for treating a contact refers to 
the responsibility of the process for executing the re¬ 
laxation of the respective contact in Alg. [l] The typical 
choice for this rule requires that the process whose sub- 
domain contains the point of contact is put in charge 
to treat the contact [Mj . 

However, this rule only works if the process whose 
subdomain contains the point is able to detect the con¬ 
tact. This is only guaranteed if the point of contact is 
located within the hull intersection. Also, if the point 
of contact is located outside of the domain 12, then no 
process will treat it. 

A more intricate drawback of this approach is that 
it can fail in case of periodic boundary conditions: If 
the contact point is located near the periodic bound¬ 
ary, the periodic image of the contact point will be 
detected at the other end of the simulation box. Due 
to the shifted position of the contact point image and 
the limited numerical precision, the subdomains can no 
longer consistently decide the subdomain affinity. 

A more robust rule to determine the subdomain 
affinity can be established by fulfilling the following re¬ 
quirement: 

Requirement 2 All shadow copy holders of a parti¬ 
cle maintain a complete list of all other shadow copy 
holders and the parent process of that particle. 

Then each process detecting a contact can determine 
the list of all processes detecting that very same con¬ 
tact, which is the list of all processes with an instan¬ 
tiation of both particles involved in the contact. This 
list is exactly the same on all processes detecting the 
contact and is not prone to numerical errors. The rule 
can then e.g. appoint the detecting process with small¬ 
est rank to treat the contact. In order to enhance the 
locality of the contact treatment, the rule should fa¬ 
vor the particle parents if they are among the contact 
witnesses. Any such rule defines a partitioning of the 
contact set C. Let C p be the set of all contacts treated 
by process p € V. Then process p instantiates all con¬ 
tacts j £ C p . 

6.3 Accumulator and Correction Variables 

The contact relaxations in Alg. [l] exhibit sums with 
non-local data dependencies. In the following, the re¬ 
dundant evaluation of these sums is prevented by intro- 
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during accumulator variables and the non-local data de¬ 
pendencies are reduced by introducing correction vari¬ 
ables. 

The relaxation of a contact j depends on the data 
of the state variables of both particles (j \, j 2 ) involved 
in the contact, their constants and shape parameters, 
as can be seen by inspecting @.(§ and the definitions 
of the terms appearing therein. All of these quantities 
are instantiated on the detecting process, either as a 
shadow copy or as an original instance. The contact 
variables of contact j (location, signed distance and 
the contact frame) are also required. They are avail¬ 
able on the detecting process since they result from 
the positions, orientations, and the shape parameters 
of the particles (ji, J 2 ) in the contact detection. Fur¬ 
thermore, the force and torque terms from (|T]) acting 
on these particles additionally depend on the locations 
xi and reaction approximations x!f' :>> of all other con¬ 
tacts l involving one of the particles (ji, J 2 ). Neither the 
locations nor the reaction approximations of these con¬ 
tacts are necessarily available on the process treating 
contact j. To rectify this deficiency, one can introduce 
contact shadow copies so that location and reaction ap¬ 
proximation can be mirrored at every instantiation of 
both particles involved in the contact. However, the or¬ 
ganisational overhead of contact shadow copies can be 
circumvented. It is not necessary that the process treat¬ 
ing the contact evaluates all the wrench contributions 
to the particles involved in the contact. Instead, parts 
of the wrench contribution sum can be evaluated on the 
processes actually treating the remote contacts and can 
subsequently be communicated: 



wrench contribution (/i, p Ti )P ) to particle i from process p 


parameter. Since the subdomain NBGS respects the 
subdomain affinity of the contacts, the remote wrench 
contributions to particle i cancel out, and just the total 
wrench on particle i from the last iteration is needed 
in addition to corrections stemming from contacts that 
were already relaxed by the same process. 
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Our implementation instantiates variables on pro¬ 
cess p for the reaction approximations £ R 3 I c pI of 
all contacts treated by process p. Any updates to the 
reaction approximations occur in place. Furthermore, 
an implementation can instantiate accumulator vari¬ 
ables £ M 3 I 8 pI on process p for the wrenches 

from the last iteration of all instantiated particles (shadow 
copies and original instances), where B p contains the in¬ 
dices of all shadow copies and original instances instan¬ 
tiated on process p. This set is partitioned into B p ,i oca i 
and B P}S hadow , containing the indices of the original in¬ 
stances and the shadow copies respectively. 

Instead of evaluating the wrench contribution sums 
each time when calculating the total wrench on parti¬ 
cle i anew, the contributions can be accumulated as the 
contacts are relaxed. For that purpose, implementations 
can instantiate corrections variables 5f^ £ ]R 3 I b pI and 
£ R 3 I b pI. Then, after line [ 9 ] of Alg. [TJ these wrench 
corrections can be updated by assigning 


' =( 3)1 
= ( 3)1 


< 5 /]: 
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The total wrench on particle i can also be expressed 
in terms of the total wrench on particle i at the begin¬ 
ning of iteration k: 


The evaluation of the total wrench on particle i in 
line [8] of Alg. [l] when relaxing contact j in iteration k 
becomes 




When relaxing the contact j in iteration k of the 
subdomain NBGS, the wrench on particle i £ {ji,J 2 } 
is evaluated with the reaction approximation X^’ J ) as 


W- U M,)I J w-w J • 

that is the sum of the accumulator and the correction 
variables. 

At the end of each iteration the wrench corrections 
for each body have to be reduced and added to the accu¬ 
mulated wrench from the last iteration. This can be per¬ 
formed in two message exchanges. In the first message 
exchange each process sends the wrench correction of 
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each shadow copy to its parent process. Then each pro¬ 
cess sums up for each original instance all wrench cor¬ 
rections obtained from the shadow copy holders, its own 
wrench correction, and the original instance’s accumu¬ 
lated wrench. Subsequently, the updated accumulated 
wrench of each original instance is sent to the shadow 
copy holders in a second message-exchange communi¬ 
cation step. The wrench corrections are then reset ev¬ 
erywhere. 

The accumulated wrenches f ^are initialized 
on each process p before line [3] in Alg. |T] to 


be the set of process indices in direct neighborhood of 
process p’s subdomain, and let 


l d d = min inf { || y p - y q || 2 

pEV j 


Up g ) y? ^ - / ? 

qeV\(M p u{ P }) I 


be the shortest distance from a point inside a subdo¬ 
main to a non-nearest neighbor. Then the condition 


Ti + \\Vi(t)\\ 2 St + t < l dd VieB 


(7) 



V* G B p 


unless the initial solution is chosen to be non-zero. The 
wrench corrections are initially set to 0. If the external 
forces and torques are not known on each process or are 
scattered among the processes having instantiated the 
particles, the initialization requires another two mes¬ 
sage exchanges, as they are necessary at the end of each 
iteration. 

An alternative to storing accumulated wrenches and 
wrench corrections is to store accumulated velocities 
and velocity corrections. In that case, a process p in¬ 
stantiates variables Sv^, G K 3 l Bp L The 

accumulated velocities are set to i^(A^) and u>[ (AW) 
for all * G B p in each iteration. They are initialized 
and updated accordingly. The velocity corrections are 
initialized and updated analogously to the wrench cor¬ 
rections. Hereby, the velocity variables can be updated 
in place. In the classic NBGS no wrench or velocity 
correction variables would be necessary, but the correc¬ 
tions could be added to the velocity variables right away 
which is similar to the approach suggested by Tasora et 

al. in [37]. 


6.4 Nearest-Neighbor Communication 

In the following we describe how the strict locality of 
particle interactions can be used to optimize the paral¬ 
lel communication and synchronization by exchanging 
messages only between nearest neighbors. So far the 
shadow copies can be present on any process, and the 
corrections in the summation over wrench or velocity 
corrections can originate from a long list of processes. 
However, by requiring that the particle hulls do not 
extend past any neighboring subdomains, all message 
exchanges can be reduced to nearest-neighbor commu¬ 
nications. Let 

U p = {q G V \ {p} | inf {||» p - y q || 2 I y P G f2 p , y g € fl q } = 0} 


ensures in the first approximation that no hull extends 
past neighboring subdomains. This immediately defines 
a hard upper limit of fi < ldd — t for the bounding ra¬ 
dius and thus for the size of all objects. Furthermore, 
given the particle shapes, velocities, and safety mar¬ 
gins, the condition defines an upper limit for the time- 
step length. The introduction of condition ([TJ) entails 
that on a process p only the description of the subdo¬ 
mains within f2 p + {y G M 3 | ||y ||2 < ldd} needs to be 
available, meaning that the description of non-nearest- 
neighbor subdomains can be dispensed with, and that 
the description of nearest-neighbor subdomains do not 
have to be correct outside of the /^-surrounding of fi p . 
This leads to a localized description of the domain par¬ 
titioning on each process, describing the surrounding 
subdomains only. 

Typically, the size limit stemming from ([7]) is not a 
problem for the particles of the granular matter them¬ 
selves, but very well for boundaries or mechanical parts 
the granular matter interacts with. However, the num¬ 
ber of such enlarged bodies is typically significantly 
smaller than and independent of the number of small¬ 
sized particles, suggesting that they can be treated glob¬ 
ally. Let Bgiobai be the set of all body indices exceeding 
the size limit. These bodies will be referred to as being 
global in the following. All associated state variables 
and constants shall be instantiated on all processes and 
initialized equally. The time-integration of these global 
bodies then can be performed by all processes equally. 
If a global body i has infinite inertia ( rrii = oo and 
I 3 = ool), such as a stationary wall or a non-stationary 
vibrating plate, the body velocities are constant, and no 
wrenches need to be communicated. Global bodies hav¬ 
ing a finite inertia can be treated by executing an all- 
reduce communication primitive whenever reducing the 
wrench or velocity corrections of the small-sized bodies. 
Instead of only involving neighboring processes, the all- 
reduce operation sums up the corrections for each global 
body with finite inertia from all processes and broad¬ 
casts the result, not requiring any domain partitioning 
information. 
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6.5 Time-Integration and Synchronization Protocol 

Having solved the contact problem F( A) = 0 by Alg.jl] 
the time-integration defined in 0 needs to be per¬ 
formed. If the NBGS implementation uses velocity ac¬ 
cumulators, the integrated velocities are at hand after 
the final communication of the velocity corrections. If 
instead the NBGS implementation uses wrench accu¬ 
mulators, the wrenches are at hand, and the velocities 
of all local bodies can be updated immediately. 

Subsequently, the time-integration of the positions 
can take place. Updating a body’s position or orienta¬ 
tion effects that the list of shadow copy holders changes 
since the intersection hull possibly intersects with dif¬ 
ferent subdomains. Also, the body’s center of mass can 
move out of the parent’s subdomain. In order to restore 
the fulfillment of the requirements [I] and [2j a process 
must determine the new list of shadow copy holders 
and the new parent process for each local body after 
the position update. Shadow copy holders must be in¬ 
formed when such shadow copies become obsolete and 
must be removed. Analogously, processes must be no¬ 
tified when new shadow copies must be added to their 
state. In this case copies of the corresponding state vari¬ 
ables, constants, list of shadow copy holder indices, and 
index of the parent process must be transmitted. 

All other shadow copy holders must obtain the new 
state variables, list of shadow copy holder indices, and 
index of the parent process. Hereby, the condition from 
0 guarantees that all communication partners are neigh¬ 
bors. All information can be propagated in a single ag¬ 
gregated nearest-neighbor message-exchange. The in¬ 
formation should be communicated explicitly and should 
not be derived implicitly, in order to avoid inconsisten¬ 
cies. This is essential to guarantee a safe determination 
of the contact treatment responsibilities as well as time- 
integration responsibilities. 

Our implementation of the synchronization proto¬ 
col makes use of separate containers for storing shadow 
copies and original instances in order to be able to enu¬ 
merate these different types of bodies with good per¬ 
formance. Both containers support efficient insertion, 
deletion and lookup operations for handling the fluctu¬ 
ations and updates of the particles efficiently. Further¬ 
more, the determination of the new list of shadow copy 
holders involves intersection tests between intersection 
hulls of local bodies and neighboring subdomains as re¬ 
quirement [l] explains in Sect. |6.2] However, determining 
the minimal set of shadow copy holders is not necessary. 
Any type of bounding volumes can be used to ease in¬ 
tersection testing. In particular bounding spheres either 
with tightly fitting bounding radii fi + hi(t) or even 
with an overall bounding radius maXj e ,gFj + hi(t) as 


1: procedure simulateTimeStep 

2: C Pi ^ = broadPhaseCollisionDetection 

3: C Pt np = NARROWPHASEC OLLISION DETECTION (C p; bp ) 

4: C p = filterContacts(C P i „ p ) 

5: initializeAccumulatorAndCorrectionVariables 

6 : k <- 0 

7: Ahl «- 0 

8: while convergence criterion not met do 

9: for j «— 1 to u c A j £ C v do 

10: A' pl <- uFf 1 (0, Af pl ) + (1- u;)Ah ] 

11: end for 

12: reduceCorrections 

13: k <- k + 1 

14: end while 

15: integrateState Variables 

16: SYNCHRONIZE 

17: end procedure 

Algorithm 2: A single time step of the simulation on 
process p. 

proposed by Shojaaee et al. in [33 are canonical. Con¬ 
cerning the geometry of the subdomains at least the 
subdomain closures can be used for intersection test¬ 
ing. In our implementation we chose to determine al¬ 
most minimal sets of shadow copy holders by testing 
the intersections of the actual hull geometries of the 
particles with the closures of the subdomains. This re¬ 
duces the number of shadow copies and thus the overall 
communication volume in exchange for more expensive 
intersection tests. 


6.6 Summary 

Alg .[^summarizes the steps that need to be executed on 
a process p when time-integrating the system for a sin¬ 
gle time step St in parallel. The algorithm requires that 
all shadow copies are instantiated on all subdomains 
their hull intersects with. Furthermore, the shadow copies 
must be in sync with the original instance, and the 
global bodies must also be in sync to each other. The 
positions of all local bodies must be located within the 
local subdomain. The time step proceeds by executing 
the broad-phase contact detection which uses the po¬ 
sitions, orientations, shapes, hull expansion radii, and 
possibly information from previous time steps, in order 
to determine a set of contact candidates (body pairs) 
C p ,bp on process p in near-linear time. 

Then, in the narrow-phase contact detection, for 
all candidates the contact location, associated contact 
frame, and signed contact distance is determined if the 
hulls actually intersect. Finally, this set of detected con¬ 
tacts C p ,np needs to be filtered according to one of the 
rules presented above, resulting in C p , the set of contacts 
to be treated by process p. Before entering the itera- 
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tion of the subdomain NBGS, the accumulator, correc¬ 
tion, and contact reaction variables must be initialized. 
The initialization of the accumulator variables requires 
an additional reduction step if the external forces or 
torques cannot be readily evaluated on all processes. 

Each iteration of the subdomain NBGS on process p 
involves a sweep over all contacts to be treated by the 
process. The contacts are relaxed by a suitable one- 
contact solver. The j indexing indicates that such a 
solver typically needs to evaluate the relative contact 
velocity under the assumption that no reaction acts at 
the contact j. This can be achieved by subtracting out 
the corresponding part from the accumulator variables. 
The weighted relaxation result is then stored in place. 
The update of the wrench or velocity correction vari¬ 
ables is not explicitly listed. After the sweep the wrench 
or velocity corrections are sent to the respective parent 
process and summed up per body including the accu¬ 
mulator variables. Then the accumulator variables are 
redistributed to the respective shadow copy holders in 
a second message-exchange step. 

After a fixed number of iterations or some prescribed 
convergence criterion is met, the time step proceeds by 
executing the time-integration for each local body. The 
changes of the state variables must then be synchro¬ 
nized in a final message-exchange step, after which the 
preconditions of the next time step are met. Any user 
intervention taking place between two time steps needs 
to adhere to these requirements. 


7 Experimental Validation of Scalability 


This section aims to assess the scalability of the paral¬ 
lelization design presented in Sect. [6] as we implemented 
it in the pe which is an open-source software frame¬ 
work for massively parallel simulations of rigid bod¬ 
ies [13 EH- The implementation is based on velocity ac¬ 
cumulators and corrections, as introduced in Sect. |6.3[ 
The accumulator initialization performs an additional 
initial correction reduction step in all experiments. 


In Sect. |7.l| the idea behind weak- and strong-scaling 
experiments is explained before presenting the test prob¬ 


lems for which those experiments are executed in Sect. 7.2 
The scaling experiments are performed on three clus¬ 
ters whose properties are summarized and compared 
in Sect. |7.3| Sect. |7.4| points out the fundamental dif¬ 
ferences in the scalability requirements of the two test 
problems. Finally, in Sect. |7.5] the weak-scaling and in 
Sect. |7.6| the strong-scaling results are presented for 
each test problem and cluster. 


7.1 Weak and Strong Scalability 

To demonstrate the scalability of the algorithms and 
their implementation, we perform weak-scaling exper¬ 
iments, where the problem size is chosen directly pro¬ 
portional to the number of processes and such that the 
load per process stays constant. Thus, if ideal scaling 
were achieved, the time to solution would stay constant. 
Let t p be the time to solution on p processes, then the 
parallel efficiency e p , ws in a weak-scaling experiment is 
defined to be 

_ ti 

&p,WS — , 
ip 

In strong-scaling experiments, in contrast, the prob¬ 
lem size is kept constant, effecting a decreasing work 
load per process when increasing the number of pro¬ 
cesses. Thus, ideally the time to solution on p processes 
should be reduced by p in comparison to the time to 
solution on a single process. The speedup s p on p pro¬ 
cesses is defined to be 


ti 



The parallel efficiency e PjSS in a strong-scaling experi¬ 
ment is then the fraction of the ideal speedup actually 
achieved 

s p t\ 

&p,ss — — ; ■ 

P Ptp 

Sometimes speedup and parallel efficiency are also 
stated with respect to a different baseline, that is, a 
single central processing unit (CPU) or a single node 
rather than a single hardware thread or core - the 
principle remains the same. The parallel efficiency in 
a weak- and strong-scaling context is a simple perfor¬ 
mance metric that will serve in the following to assess 
the quality of the parallelization. 


7.2 Test Problems 

The scalability of the parallelization algorithm as it is 
implemented in the pe framework is validated based on 
two fundamentally different families of test problems. 
Sect. |7.2.T[ describes a family of dilute granular gas se¬ 
tups whereas Sect. |7.2.2| describes a family of hexagonal 
close packings of spheres corresponding to structured 
and dense setups. We chose these setups because their 
demands towards the implementation vary considerably 
which will be analyzed in detail in Sect. |7.4] 
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7.2.1 Granular Gas 

Granular material attains a gaseous state when suffi¬ 
cient energy is brought into the system, for example by 
vibration. Consequently, granular gases feature a low 
solid volume fraction and are dominated by binary col¬ 
lisions. When the energy supply ceases, the system cools 
down due to dissipation in the collisions. Granular gases 
are not only observed in laboratory experiments, but 
appear naturally for example in planetary rings |35j and 
in technical applications such as granular dampers [19j. 
These systems in general exhibit interesting effects like 
the inelastic collapse [M] or other clustering effects as 
they e.g. can be observed in the Maxwell-demon exper¬ 
iment m- 

As initial conditions, a rectangular domain with con¬ 
fining walls is chosen. The domain contains a prescribed 
number of non-spherical particles arranged in a Carte¬ 
sian grid. Random initial velocities are assigned to the 
particles with up to m /s ss 0.35 m /s. The particles 
are composed of two to four spheres of varying radius 
in the range [ 0 . 6 cm, 0 . 8 cm], arranged at the bound¬ 
ary of a bounding sphere with a diameter of 1 cm. The 
distance between the centers of two granular particles 
along each spatial dimension is 1 . 1 cm, amounting to 
a solid volume fraction of 23% on average. In pj5] al¬ 
most the same family of setups served as a scalability 
test problem. However, there the granular gases had a 
solid volume fraction of 3.8% on average. In order to 
test a higher collision frequency, a more dense granular 
gas was chosen here. The system is simulated for ^ s, 
and the time step is kept constant at 100 ps, resulting 
in 1 000 time steps in total. Since the contacts are dis¬ 
sipative and no energy is added, the system is quickly 
cooling down. The coefficient of friction is 0.1 for any 
contact. 

For this test problem, the subdomain NBGS solver 
requires a slight underrelaxation in order to prevent di¬ 
vergence. Using an underrelaxation parameter of 0.75 
produces good results. For binary collisions, a single it¬ 
eration of the solver would suffice, but because particles 
cluster due to the inelastic contacts, more iterations are 
required. This could be determined by a dynamic stop¬ 
ping criterion, but in the scenario presented here it was 
found to be more efficient to perform a fixed number of 
10 iterations. 

For particle simulations, the work load strongly de¬ 
pends on the number of particles and contacts. For the 
weak-scaling experiments, each process is responsible 
for a rectangular subdomain, initially containing a fixed 
number of particles arranged in a Cartesian grid. For 
the strong-scaling experiments, the total number of par¬ 
ticles in x-, y-, and z-dimension should be divisible by 


the number of processes in x-, y-, and z-dimension that 
is used in the experiment. With this arrangement the 
initial load is perfectly balanced. Statistically, the load, 
that is the number of particles and contacts per sub- 
domain, remains balanced if the subdomains are large 
enough, and clustering effects have not yet progressed 
too far. The duration of the simulation was chosen such 
that the load remains well balanced throughout. 

7.2.2 Hexagonal Close Packing of Spheres 

This setup aims to assess the scalability of the par¬ 
allelization for dense granular setups. To demonstrate 
the scalability, the initial setup should be easily and 
efficiently generateable for arbitrary problem sizes and 
should feature a good load balance over a longer pe¬ 
riod of time. Hence, a hexagonal close packing of equal 
spheres was chosen, for which simple formulas for the 
position of the spheres are available. The packing den¬ 
sity is known to be 7 ^ « 74.0%. According to the Ke¬ 
pler conjecture, a hexagonal close packing is the densest 
possible packing of spheres. To avoid load imbalances 
at the boundaries, a domain is chosen that is periodic 
in the x- and y-dimension. In z-dimension the packing 
is confined by walls that are in direct contact with the 
spheres on both sides. Assuming an even number of 
particles in y-direction, the number of contacts is per¬ 
manently n x n y (6n z — 1) for n x x n y x n z particles. The 
domain is decomposed in x- and y-dimensions only. The 
objects are subject to gravity. However, the gravity is 
tilted in the x-z-plane such that the setup corresponds 
to a ramp inclined by 30° including a lid. The mag¬ 
nitude of gravity is 9.81 m /s 2 . The time step is 10 ps 
constantly. The radii of the particles are 1 mm, and 
their density is 2.65s/cm 3 . All particles get an initial 
downhill velocity of 10 cm /s. The coefficient of friction 
is 0.85 constantly for any contact. The high coefficient 
of friction causes a slip-stick transition shortly after the 
simulation begins. As in the granular gas setups the 
subdomain NBGS uses an underrelaxation of 0.75. The 
solver unconditionally performs 100 iterations in each 
time step. This intentionally disregards that the iter¬ 
ative solver converges faster for smaller problems. A 
multigrid solver could possibly remedy the dependence 
on the problem size, but the successful construction of 
such a solver needs substantial further research. 

7.3 Test Machines 

In the following all test machines are presented. Tab. [I] 
summarizes the basic information. The Emmy cluster 
is located at the Regional Computing Centre in Er¬ 
langen (RRZE) in Germany which is associated to the 
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cluster name 

Emmy 

SuperMUC 

Juqueen 

computing centre 

Regional Computing Centre in 

Leibniz Supercomputing 

Jiilich Supercomputing 

Erlangen (RRZE), Germany 

Centre (LRZ), Germany 

Centre (JSC), Germany 

best TOP 500 ranking 

- 

4th (June 2012) 

5th (November 2012) 

peak performance in pf 1 o p/ s 

0.23 

3.2 

5.9 

number of nodes 

560 

9216 

28 672 

number of sockets 

2 

2 

1 

name of CPU 

Intel Xeon E5-2660 v2 

Intel Xeon E5-2680 

IBM PowerPC A2 

clock rate in GHz 

2.2 

2.7 

1.6 

number of cores per CPU 

10 

8 

16 

number of threads per core 

2 

2 

4 

total RAM in TiB 

35 

288 

448 

interconnection fabric 

Infiniband QDR 

Infiniband QDR/ 
Infiniband FDR 10 

BlueGene/Q 

network topology 

non-blocking tree 

non-blocking tree/ 

4:1 pruned tree 

5D torus 


Table 1: The test machines used for performing the weak- and strong-scaling experiments. 


Friedrich-Alexander-Universitat Erlangen-Niirnberg. The 
cluster comprises 560 compute nodes. Each node has a 
dual-socket board equipped with two Xeon E5-2660 v2 
processors. Each processor has 10 cores clocked at 2.2 GHz. 
The processors offer 2-way simultaneous multithread¬ 
ing (SMT). The peak performance of the cluster is 
0.23 pf1o p/s. Each node is equipped with 64GiB of ran¬ 
dom access memory (RAM). The cluster features a fully 
non-blocking Infiniband interconnect with quad data 
rate (QDR) and 4x link aggregation, resulting in a 
bandwidth of 40 Gbit /s p er link and direction. In all ex¬ 
periments on the Emmy cluster, each core is associated 
with a subdomain since preliminary tests showed that 
we could not take advantage of the SMT features by 
associating each hardware thread with a subdomain. 
The Emmy cluster has the smallest peak performance 
among the test machines and was never among the 500 
world’s largest commercially available supercomputers. 
However, it is the only machine with the largest non- 
blocking tree network topology and the largest amount 
of RAM per core. 

The second test machine is the SuperMUC super¬ 
computer which is located at the Leibniz Supercomput- 
ing Centre (LRZ) in Germany and was best ranked on 
the 4th place of the TOP 500 list in June 2012. The 
cluster is subdivided into multiple islands. The major¬ 
ity of the compute power is contributed by the 18 thin- 
node islands. Each thin-node island consists of 512 com¬ 
pute nodes (excluding four additional spare nodes) con¬ 
nected to a fully non-blocking 648 port FDR10 Infini¬ 
band switch with 4x link aggregation, resulting in a 
bandwidth of 40 Gbit /s per link and direction. Though 
QDR and FDR10 use the same signaling rate, the ef¬ 
fective data rate of FDR10 is more than 20% higher 
since it uses a more efficient encoding of the transmit¬ 
ted data. The islands’ switches are each connected via 


126 links to 126 spine switches. This results in a block¬ 
ing switch-topology. Thus, if e.g. all nodes within an 
island send to nodes located in another island, then 
the 512 nodes have to share 126 links to the spine 
switches, effecting that the bandwidth is roughly one 
quarter of the bandwidth that would be available in an 
overall non-blocking switch-topology. Each (thin) com¬ 
pute node has two sockets, each equipped with an In¬ 
tel Xeon E5-2680 processor having 8 cores clocked at 
2.7 GHz. The processors support 2-way SMT. In the 
following, as in the case of the Emmy cluster, each core 
is associated with a single subdomain. The peak per¬ 
formance of the cluster is stated to be 3.2 pf1o p/s. Each 
node offers 32 GiB of RAM, summing up to 288 TiB in 
total. The SuperMUC supercomputer has an interest¬ 
ing blocking tree network-topology and the processors 
with the highest clock rate among the processors in the 
test machines. 

The third test machine is the Juqueen supercom¬ 
puter which is located at the Jiilich Supercomputing 
Centre (JSC) in Germany and was best ranked on the 
5th place of the TOP 500 list in November 2012. The 
cluster is a BlueGene/Q system with 28 672 compute 
nodes since 2013 muni- Each node features a single 
IBM PowerPC A2 processor having 18 cores clocked at 
1.6 GHz, where only 16 cores are available for comput¬ 
ing. The processors support 4-way SMT. The Juqueen 
supercomputer is the only machine, where we decided 
to associate each hardware thread with a subdomain 
in the scaling experiments. The machine’s peak perfor¬ 
mance is 5.9 pf1o p/s. Each node offers 16 GiB of RAM, 
summing up to 448 TiB in total. The interconnect fabric 
is a 5D torus network featuring a bandwidth of 16 Gb it/ S 
per link and direction [8J. The Juqueen supercomputer 
is the machine with the highest peak performance, the 
largest number of cores and threads and the only ma- 
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chine among our test machines with a torus intercon¬ 
nect. 

Tab. [2] presents a summary of the domain parti¬ 
tionings used for the scaling experiments on the vari¬ 
ous clusters. The number of nodes are always a power 
of two except when using the whole machine or when 
performing intra-node scalings. The intra-node scaling 
behaviour is analyzed by means of weak-scaling exper¬ 
iments choosing the granular gas as a test problem and 
the Emmy cluster as a test machine. The influence of 
the number of dimensions in which the domain is par¬ 
titioned is also only analyzed for this configuration. 
All further scaling tests of the granular gas scenario 
use three-dimensional domain partitionings. All inter¬ 
node weak-scaling experiments start with a single node 
and extend to the full machine where possible. The ex¬ 
periments on the SuperMUC supercomputer were ob¬ 
tained at the Extreme Scaling Workshop in July 2013 
at the LRZ, where at most 16 islands corresponding 
to 8 192 nodes were available. All strong-scaling exper¬ 
iments start on a single node except on the Juqueen 
supercomputer, where we chose to start at 32 nodes 
which is the minimum allocation unit in the batch sys¬ 
tem on Juqueen. The experiments extend to a number 
of nodes where a notable efficiency degradation is ob¬ 
served. Since the results on the SuperMUC were ob¬ 
tained well before the other experiments, no scaling 
experiments with the hexagonal close packing scenario 
were performed. 


7.4 Time-Step Profiles 

In this section we clarify how much time is spent in 
the various phases of the time-step procedure and how 
this time changes in a weak scaling depending on the 
test problem. Fig. [l] breaks down the wall-clock times 
of various time step components in two-level pie charts 
for the granular gas scenario. The times are averaged 
over all time steps and processes. The dark blue section 
corresponds to the fraction of the time in a time step 
used for detecting and filtering contacts. The orange 
section corresponds to the time used for initializing the 
velocity accumulators and corrections. The time to re¬ 
lax the contacts is indicated by the yellow time slice. It 
includes the contact sweeps for all 10 iterations with¬ 
out the correction reductions. The time used by all cor¬ 
rection reductions is shown in the green section which 
includes the reductions for each iteration and the re¬ 
duction after the initialization. The time slice is split 
up on the second level in the time used for assembling, 
exchanging, and processing the first correction reduc¬ 
tion message (dark green section) and the time used 
for assembling, exchanging, and processing the second 


H Contact Detection 
I Initialization 
■ Contact Sweeps 



(a) Time-step profile of the 
granular gas executed with 
5 x 2 x 2 = 20 processes on 
a single node. 


I Correction Reductions 

■ Position Integration 

■ Synchronization 



(b) Time-step profile of the 
granular gas executed with 
8 x 8 x 5 = 320 processes on 
16 nodes. 


Fig. 1: The time-step profiles for two weak-scaling exe¬ 
cutions of the granular gas on the Emmy cluster with 
25 3 particles per process. 


correction reduction message (light green section). The 
time slices are depicted counterclockwise in the given 
order. The message-exchange communications have a 
dotted border to distinguish them from the rest. A 
single message-exchange communication time measure¬ 
ment started, when sending the first message buffer to 
the neighbors, and ended, when having received the last 
message buffer from the neighbors. The dark red section 
corresponds to the time used by the time-integration 
of the positions, and the final blue section indicates 
the time used by the position synchronization. The lat¬ 
ter is split up into assembling, exchanging, and pro¬ 
cessing of the message in the inner ring. The message- 
exchange communication is highlighted by the dashed 
border again. The first pie chart in Fig. |la| corresponds 
to the time-step profile of an execution in the weak- 
scaling experiment with the three-dimensional domain 
partitioning 5 x 2 x 2 on a single node of the Emmy 
cluster. Fig. [lb] shows the time-step profile of an exe¬ 
cution in the weak-scaling experiment with the three- 
dimensional domain partitioning 8 x 8 x 5 on 16 nodes. 
The two time slices involving communication need more 
time in comparison to Fig. |la[ especially the framed 
slices on the second level which amount to the com¬ 
munication. The wall-clock time for the components 
involving no communication was roughly the same in 
both runs. The enlarged synchronization time-slices in 
Fig. [Tb] then approximately amount to the increased 
time-step duration on 16 nodes. Overall, computations 
in the time step of this granular gas scenario prevail. 
But since the collision frequency is low, the 10 contact 
sweeps, marked by the yellow and green sections, are 
dominated by communication. 
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(a) Domain partitionings used on the Emmy cluster. 
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(b) Domain partitionings used on the Juqueen supercomputer. 
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(c) Domain partitionings used on the SuperMUC supercomputer. 

Table 2: Summary of the domain partitionings used on all test clusters. 


Fig.[2]presents time-step profiles for two weak-scaling 
executions of the hexagonal close packing scenario. The 
time-step profiles use the same color coding as in Fig. |T] 
In contrast to the time-step profiles of the granular gas 
scenario, the time step is dominated by the 100 contact 
sweeps (yellow section) and the 100 correction reduc¬ 
tions (green section). Contact detection, position inte¬ 
gration, and synchronization play a negligible role. In 
Fig.[2a]the time-step profile of a weak-scaling execution 
with again 20 processes on a single node of the Emmy 
cluster is presented, whereas in Fig. [2b] the time-step 
profile of a weak-scaling execution with again 320 pro¬ 
cesses on 16 nodes is shown. The wall-clock time spent 
in the contact sweep was roughly the same in both ex¬ 
ecutions, hence the increased communication costs are 


mainly responsible for the larger time slice of the cor¬ 
rection reduction. 


The time-step profiles showed that for the dilute 
granular gas scenario the time spent in the various time- 
step components is well balanced and the time spent in 
the communication routines moderately increases as the 
problem size is increased. For the hexagonal close pack¬ 
ings most of the time is spent in the contact sweeps and 
the reduction of the velocity corrections. Components 
such as the position integration and the final synchro¬ 
nization play a negligible role due to the higher number 
of iterations in comparison to the granular gas scenario. 
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Granular Gas 


Hexagonal Close Packing 

Emmy 

Juqueen 

SuperMUC 

Emmy 

Juqueen 

number of particles per process 

25 3 

10 3 

10 3 

10 3 

10 3 

number of time steps 

1000 

1000 

10 000 

1000 

100 

maximum number of particles 

1.6- 10 s 

1.8- 10 9 

1.3- 10 s 

1.0- 10 7 

1.8- 10 9 

initial number of contacts 

0 

0 

0 

6.0- 10 7 

1.1-10 10 

solid volume fraction 

23% 

23% 

3.8% 

74% 

74% 


Table 3: Summary of the test problem parameters used for the weak-scaling experiments. 


■ Contact Detection 

■ Initialization 

I Contact Sweeps 



(a) Time-step profile of the 
hexagonal close packing sce¬ 
nario executed with 5 x 2 x 
2 = 20 processes on a single 
node. 


H Correction Reductions 

■ Position Integration 

■ Synchronization 



(b) Time-step profile of the 
hexagonal close packing sce¬ 
nario executed with 8 x 8 x 
5 = 320 processes on 16 
nodes. 



Fig. 3: Measured bandwidth of the triad in the stream 
benchmark computed with a varying number of cores 
on a single node of the Emmy cluster. 


Fig. 2: The time-step profiles for two weak-scaling exe¬ 
cutions of the hexagonal close packing scenario on the 
Emmy cluster with 10 3 particles per process. 


7.5 Weak-Scaling Results 

In the following subsections the weak-scaling results 
for both test problems on the clusters are presented. 
Tab. [3] gives an overview of the employed parameters. 
The experiments differ in terms of the number of parti¬ 
cles generated per process depending on the amount of 
memory available. In order to control the overall wall- 
clock time the number of time steps performed varies 
between 100 and 10 000. All wall-clock times presented 
in the following subsections correspond to the average 
wall-clock time needed to perform a single time step 
per 1 000 particles facilitating the comparison of the 
charts. The wall-clock times exclude the time needed to 
setup the systems and generate the simulation output. 
The scaling experiments of the granular gas scenario on 
SuperMUC differs from the other granular gas experi¬ 
ments in that the gas is considerably more dilute and a 
longer period of time is simulated. 


1.5.1 Granular Gas 

First, we pay special attention to the intra-node weak- 
scaling before turning to the inter-node weak-scaling 
since the former is subject to the non-linear scaling 
behaviour of the memory bandwidth. As a test prob¬ 
lem we chose the granular gas scenario and as the test 
machine the Emmy cluster. A single node in the clus¬ 
ter is equipped with two processors each one having 
a single on-chip memory controller. The total memory 
bandwidth available to both sockets is exactly twice 
the bandwidth of a single socket. However, for a single 
socket a simple stream benchmark [23] reveals that the 
memory architecture is designed such that for x cores 
more than 1 of the socket’s total memory bandwidth is 
available. Fig.[3]plots the measured memory bandwidth 
of computations of the triad as defined in the stream 
benchmark. The computations were performed by a 
varying number of cores in parallel. The first series of 
measurements minimized the number of sockets in use 
meaning that the processor affinities of the processes 
were adjusted such that all processes in measurements 
with less or equal to 10 processes shared the same mem¬ 
ory controller. In the second series of measurements the 
processes were pinned to the sockets alternately such 
that 2x processes had twice the bandwidth at their dis- 
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Fig. 4: Intra-node weak-scaling graphs for a granular 
gas on the Emmy cluster. 


posal as x processes in the first series. Indeed, the lower- 
left part of the first series’ graph very well matches the 
second series’ graph with proper scaling. The measured 
bandwidth in the first series increases for an increasing 
number of processes until the available memory band¬ 
width of the first memory controller is saturated. Mea¬ 
surements with more than 10 processes start to make 
use of the second memory controller and the measured 
bandwidth continues to increase linearly. 

An analogous behavior can be observed in the intra¬ 
node weak-scaling graphs. Fig.[4]plots the average wall- 
clock time needed for a single time step and 1 000 par¬ 
ticles. In the first series of executions again the pin¬ 
ning strategy minimizing the number of sockets in use 
was employed. The average wall-clock time needed per 
time step increases considerably for executions with 
up to 10 processes. However, beyond that point the 
weak-scaling graph continues almost ideally. The sec¬ 
ond series of executions used as before the pinning strat¬ 
egy minimizing the maximum number of processes per 
socket meaning that executions with 2x processes in 
the second series have twice the bandwidth at their 
disposal as the executions with x processes in the first 
series. The graphs of the second series show that the 
wall-clock times needed per time step on 2x processes 
indeed closely match the wall-clock times on x processes 
in the first series. This indicates that our implementa¬ 
tion is limited by the available memory bandwidth. 

The figure also distinguishes between weak-scaling 
graphs with one-, two-, and three-dimensional domain 
partitionings since their communication volumes differ. 
Higher-dimensional non-periodic domain partitionings 
have typically a higher communication volume in com¬ 
parison to lower dimensional non-periodic domain par¬ 
titionings with the same number of processes, due to 
the larger area of the interfaces between the subdo¬ 
mains. The plotted timings for the one-dimensional do¬ 


main partitionings are indeed consistently slightly bet¬ 
ter than the timings for two-dimensional domain par¬ 
titionings, which are in turn slightly better than the 
timings for three-dimensional domain partitionings. 


Even though the intra-node weak-scaling results re¬ 
veal an underperforming parallel efficiency between 30.8% 
and 32.9% when computing on all cores of an Emmy 
node, the correlation with the measured memory band¬ 
width of a triad suggests that a good intra-node scal¬ 
ing can be expected as long as the available bandwidth 
scales. With corresponding pinning this is the case as 
off the first full socket on the Emmy cluster. 


Fig. 5a extends the weak-scaling experiment to al¬ 
most the full Emmy cluster for one-, two-, and three- 
dimensional domain partitionings. The scaling experi¬ 
ment for the one-dimensional domain partitionings per¬ 
forms best and achieves on 512 nodes a parallel effi¬ 
ciency of 98.3% with respect to the single-node per¬ 
formance. The time measurements for two-dimensional 
domain partitionings are consistently slower, but the 
parallel efficiency does not drop below 89.7%. The time 
measurements for three-dimensional domain partition¬ 
ings come in last, and the parallel efficiency goes down 
to 76.1% for 512 nodes. This behaviour can be explained 
by the differences in the communication volumes of 
one-, two-, and three-dimensional domain partitionings. 
The results attest that the problem can be efficiently 
scaled (almost) up to the full machine if the load per 
process is sufficiently large. 


Fig. [5b] shows the results of the inter-node weak- 
scaling experiments on the Juqueen supercomputer. The 
scaling experiments were only performed with the more 
demanding three-dimensional domain parititionings. In 
the first series of measurements the average wall-clock 
time per time step increases as expected up to 2 048 nodes 
But then the average time-step duration for setups with 
4 096 nodes and beyond is significantly shorter than the 
average time-step duration with fewer nodes. The time 
steps are even computed faster than on a single node, 
where no inter-node communication takes place at all. 
Assuming that intra-node communication is faster than 
inter-node communication, this is a puzzling result. In 
fact, it turned out that the intra-node communication 
was responsible for the behaviour: The default mecha¬ 
nism for intra-node communication is via shared mem¬ 
ory on the Juqueen. In the second series of measure¬ 
ments we disallowed the usage of shared memory for 
intra-node communication. This resulted in the mea¬ 
surements that are consistently faster than the mea¬ 
surements from the first series, and the parallel effi¬ 
ciency is more or less monotonically decreasing with an 
excellent parallel efficiency of at least 92.9%. 
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(a) Weak-scaling graph on the Emmy cluster. 



(b) Weak-scaling graph on the Juqueen supercomputer. 



(c) Weak-scaling graph on the SuperMUC supercomputer. 

Fig. 5: Inter-node weak-scaling graphs for a granular 
gas on all test machines. 


The reason why the measured times in the first 
series became shorter for 4 096 nodes and more is re¬ 
vealed when considering how the processes get mapped 
to the hardware. The default mapping on Juqueen is 
ABCDET, where the letters A to E stand for the five 
dimensions of the torus network, and T stands for the 
hardware thread within each node. The six-dimensional 
coordinates are then mapped to the MPI ranks in a 
row-major order, that is, the last dimension increases 


fastest. The T coordinate is limited by the number of 
processes per node, which was 64 for the above measure¬ 
ments. Upon creation of a three-dimensional communi¬ 
cator, the three dimensions of the domain partition¬ 
ing are mapped also in row-major order. This effects, if 
the number of processes in z-dimension is less than the 
number of processes per node, that a two-dimensional 
or even three-dimensional section of the domain parti¬ 
tioning is mapped to a single node. However, if the num¬ 
ber of processes in z-dimension is larger or equal to the 
number of processes per node, only a one-dimensional 
section of the domain partitioning is mapped to a single 
node. A one-dimensional section of the domain parti¬ 
tioning performs considerably less intra-node communi¬ 
cation than a two- or three-dimensional section of the 
domain partitioning. This matches exactly the situa¬ 
tion for 2 048 and 4 096 nodes. For 2 048 nodes, a two- 
dimensional section 1 x 2 x 32 of the domain partitioning 
64 x 64 x 32 is mapped to each node, and for 4 096 nodes 
a one-dimensional section 1 x 1 x 64 of the domain par¬ 
titioning 64 x 64 x 64 is mapped to each node. To sub¬ 
stantiate this claim, we confirmed that the performance 
jump occurs when the last dimension of the domain par¬ 
titioning reaches the number of processes per node, also 
when using 16 and 32 processes per node. 


Fig. [5c] presents the weak-scaling results on the Su¬ 
perMUC supercomputer. The setup differs from the 
granular gas scenario presented in Sect. 7.2.1 in that it 
is more dilute. The distance between the centers of two 
granular particles along each spatial dimension is 2 cm, 
amounting to a solid volume fraction of 3.8% and conse¬ 
quently to less collisions. As on the Juqueen supercom¬ 
puter only three-dimensional domain partitionings were 
used. All runs on up to 512 nodes were running within a 
single island. The run on 1 024 nodes also used the min¬ 
imum number of 2 islands. The run on 4 096 nodes used 
nodes from 9 islands, and the run on 8 192 nodes used 
nodes from 17 islands, that is both runs used one island 
more than required. The graph shows that most of the 
performance is lost in runs on up to 512 nodes. In these 
runs only the non-blocking intra-island communication 
is utilised. Thus this part of the setup is very similar 
to the Emmy cluster since it also has dual-socket nodes 
with Intel Xeon E5 processors and a non-blocking tree 
Infiniband network. Nevertheless, the intra-island scal¬ 
ing results are distinctly worse. The reasons for these 
differences were not yet further investigated. However, 
the scaling behaviour beyond a single island is decent 
featuring a parallel efficiency of 73.8% with respect to 
a single island. A possible explanation of the under- 
performing intra-node scaling behaviour could be that 
some of the Infiniband links were degraded to QDR, 
which was a known problem at the time the extreme- 
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(a) Weak-scaling graph on the Emmy cluster. 
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(b) Weak-scaling graph on the Juqueen supercomputer. 

Fig. 6: Inter-node weak-scaling graphs for hexagonal 
close packings of spheres. 


scaling workshop took place. The communication rou¬ 
tines then need « 1.21 times longer to complete. 
This could also explain the high variability of the runs’ 
wall-clock times. 

Subsequently, a second series of measurements was 
performed with 60 3 non-spherical particles per process. 
The scaling behaviour is comparable to the scaling be¬ 
haviour observed in Fig. [5c] However, the largest weak- 
scaling run simulated 28 311552 000 « 2.8 • 10 10 non- 
spherical particles - a possibly record-breaking number 
for non-smooth contact dynamics. 


7.5.2 Hexagonal Close Packings of Spheres 


Fig. |6a| shows the average wall-clock time needed for a 
single time step in the hexagonal close packing test on 
the Emmy cluster. The parallel efficiency with respect 
to a single node remains above 79.9% for all execu¬ 
tions. This is slightly better than the parallel efficiency 
of 76.1% for the granular gas. 

The weak-scaling results of the hexagonal close pack¬ 
ing scenario on the Juqueen supercomputer are pre¬ 


sented in Fig. 6b The parallel efficiency with respect 


to a single node stayed above 91.4% for all measure¬ 
ments. This result is almost as good as the 92.9% par¬ 
allel efficiency in the scaling experiments of the granu¬ 
lar gas. The largest execution ran 1 024 x 1 792 x 1 = 
1835 008 processes on all 28 672 nodes of the machine, 
where 10 240 x 17 920 x 10 = 1 835 008 000 particles were 
spawned, in total leading to 10 826 547 200 « 1.1 • 10 10 
contacts - again a possibly record-breaking number for 
non-smooth contact dynamics. 


7.6 Strong-Scaling Results 

In the following subsections the strong-scaling results 
for both test problems on the clusters are presented. 
Tab. [4] gives an overview of the employed parameters. 
The experiments differ in terms of the number of par¬ 
ticles generated in total and the number of time steps 
used for averaging. As in the weak-scaling experiments 
the granular gas scenario on SuperMUC is considerably 
more dilute than on the other machines. 


7.6.1 Granular Gas 

Fig-0 presents the strong-scaling results of the granu¬ 
lar gas scenario on all clusters. The strong-scaling graph 
on the Emmy cluster is presented in Fig. [7a] A total of 
320 x 160 x 160 = 8192 000 particles was used, leading 
to at most 64 x 80 x 80 = 409 600 particles per process 
on a single node and at least 10 x 8 x 10 = 800 particles 
per process on 512 nodes. The speedup is ideal for up to 
64 nodes and then gradually becomes more inefficient. 
However, no turnover is observed. Some time measure¬ 
ments exceed the optimal speedup, which can happen 
for example if the problem becomes small enough to fit 
into one of the caches. In conclusion, the scaling experi¬ 
ments for this dilute setup on the Emmy cluster suggest 
that one obtains a satisfactory parallel efficiency on the 
whole cluster, as long as several thousand particles are 
present per process. 

Fig. [7b] presents the results of the strong-scaling ex¬ 
periments on the Juqueen supercomputer for the gran¬ 
ular gas. The total number of particles was 32 768 000 
particles. In the execution on 32 nodes each of the 
16x 16x8 = 2 048 processes initially had 20 x 20 x 40 = 
16 000 non-spherical particles, and in the execution on 
4 096 nodes each of the 64 x 64 x 64 = 262 144 pro¬ 
cesses spawned 5x5x5 = 125 particles. The parallel 
efficiency is plotted with respect to 32 nodes and stays 
above 80.7% for up to 1 024 nodes and 500 particles 
per process before rapidly decreasing. On 4 096 nodes 
the efficiency is at 55.4%. The weak- and strong-scaling 
results are both better in comparison to the Emmy clus- 
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Granular Gas Hexagonal Close Packing 



Emmy 

Juqueen 

SuperMUC 

Emmy 

Juqueen 

number of particles 
number of time steps 
solid volume fraction 

320 x 160 x 160 
1000 

23% 

320 x 320 x 320 
1000 

23% 

128 x 128 x 128 
100 

3.8% 

1 280 x 640 x 10 
50 

74% 

2 048 x 2 048 x 10 
20 

74% 


Table 4: Summary of the test problem parameters used for the strong-scaling experiments. 



8 16 32 64 128 256 512 

number of nodes 


(a) Strong-scaling graph on the Emmy cluster. 



(b) Strong-scaling graph on the Juqueen supercomputer. 



(c) Strong-scaling graph on the SuperMUC supercomputer. 

Fig. 7: Strong-scaling graphs for a granular gas test 
problem on all test machines. 


ter, owed to the torus network which performs excellent 
for the nearest-neighbor communication. 

The results of the strong-scaling experiments on the 
SuperMUC supercomputer are shown in Fig. [7c] In to¬ 
tal 128 3 non-spherical particles were simulated. Hence, 
in the single-node run each process owned 32 x 64 x 64 = 
131 072 particles, and in the run on 1 024 nodes, each 
process owned 8x4x4 = 128 particles. The parallel 
efficiency is at 90.0% on 256 nodes. Beyond that point 
it decreases dramatically, indicating that the scaling is 
fine as long as at least about 500 particles are present 
per process. 


7.6.2 Hexagonal Close Packings of Spheres 


In the strong-scaling experiment on the Emmy cluster 
in total 1 280 x 640 x 10 = 8 192 000 particles were gen¬ 
erated. The experiment was run for 1 to 512 nodes, such 
that the smallest setup with 5 x 4 x 1 = 20 processes on 
a single node generated 256 x 160 x 10 = 409 600 spher¬ 
ical particles per process, and the largest setup with 
128 x 80 x 1 = 10 240 processes on 512 nodes generated 
10 x 8 x 10 = 800 particles per process. Fig. |8a|presents 
the results. A super-linear speedup is observed for sev¬ 
eral executions, which is likely due to caching effects, 
since the working set size becomes very small. In the 
strong-scaling experiment for 512 nodes of the granu¬ 
lar gas scenario on Emmy also only 800 particles were 
generated per process. However, the computational in¬ 
tensity here is much higher in comparison to that of the 
granular gas, because far more contacts have to be re¬ 
solved. This explains the high parallel efficiency of 113% 
in comparison to the disappointing parallel efficiency of 


37.7% from Fig. 7a In conclusion, the scaling experi¬ 
ments suggest that a few hundred particles per process 
are enough to achieve a very good parallel efficiency on 
the Emmy cluster if the granular flow is dense. 

For the strong-scaling experiment on the Juqueen 
supercomputer a hexagonal close packing with 41 943 040 par¬ 
ticles in total was created. The smallest execution ran 
64 x 32 x 1 = 2 048 processes on 32 nodes, where 32 x 
64 x 10 = 20480 spherical particles were generated per 
process. The largest execution ran 512 x 512 x 1 = 

262 144 processes on 4096 nodes, where 4 x 4 x 10 = 


160 particles were generated per process. Fig. 8b shows 
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(a) Strong-scaling graph on the Emmy cluster. 



number of nodes 


(b) Strong-scaling graph on the Juqueen supercomputer. 

Fig. 8: Strong-scaling graphs for hexagonal close pack¬ 
ings of spheres. 

the speedup and the parallel efficiency on the second 
axis, both with respect to 32 nodes. A parallel effi¬ 
ciency of 75.0% on 4 096 nodes is achieved, where only 
160 particles were owned per process. This suggests that 
a reasonable good efficiency can be achieved for a dense 
setup on the Juqueen supercomputer, as long as several 
hundred particles are created per process. 

8 Related Work 

Other authors have proposed approaches for paralleliz¬ 
ing non-smooth contact dynamics on architectures with 
distributed memory. All of them are based on domain 
partitionings. A parallelization strategy termed non¬ 
smooth contact domain decomposition (NSCDD) im¬ 
plemented in the renowned LMGC90 code was lately 
presented in |38j [39 j by Visseq et al. The approach 
is inspired by the finite element tearing and intercon¬ 
nect (FETI) method for solving partial differential equa¬ 
tions in computational mechanics. The authors sug¬ 
gest to decouple the multi-contact problem such that 
on each process a multi-contact problem is solved hav¬ 


ing the same structure as a multi-contact problem that 
is solved sequentially. Particles with multiple contacts 
that are associated with different subdomains are du¬ 
plicated, similar to shadow copies used in this article. 
However, the mass and inertia are split among all in¬ 
stantiations. The coupling is recovered by adding linear 
equations gluing the duplicates back together through 
additional Lagrange multipliers. In contrast to the con¬ 
tact constraints, the interface equations are linear, and 
a block-diagonal system of linear equations must be 
solved after several sweeps over all contacts. In [39j, the 
authors present simulations with up to 2 • 10 s spher¬ 
ical particles and 2 • 10 6 contacts, time-integrated on 
up to 100 processes. The NSCDD allows non-nearest- 
neighbor communication in order to allow enlarged rigid 
bodies instead of introducing a concept analogous to 
global bodies. 

Prior to Visseq et al., Koziara et al. presented the 
parallelization implemented in the solfec code [20]. This 
approach dispenses with the separation into interface 
problems and local multi-contact problems. A classic 
NBGS is parallelized with a non-negligible but inevitable 
amount of serialization. Bodies are instantiated redun¬ 
dantly on all processes, prohibiting scaling beyond the 
memory limit. Instead of using accumulator and correc¬ 
tion variables, as proposed in this paper, the authors 
synchronize dummy particles (particles that are in con¬ 
tact with shadow copies or original instances) in ad¬ 
dition to shadow copies in order to implement contact 
shadow copies. As in the NSCDD, the system matrix 
(Delassus operator) is set up explicitly instead of using 
matrix-free computations as proposed here. Simulations 
are presented with up to 1 • 10 4 polyhedral particles or 
6 ■ 10 5 contacts time-integrated on up to 64 processes. 

At the same time, Shojaaee et al. presented another 
domain partitioning method in |34| . The presentation 
is restricted to two-dimensional problems. The solver 
in the paper corresponds to a subdomain NBGS with 
relaxation parameter ui = 1, where the authors argue 
that divergence does typically not occur. At least for 
three-dimensional simulations this is in our experience 
not sufficient. Shadow copies are created not only if the 
hulls overlap the neighboring subdomain but also if the 
particles approach the subdomain boundaries, simpli¬ 
fying the intersection testing but introducing excessive 
shadow copies. Shojaaee et al. also introduce contact 
shadow copies instead of using accumulator and cor¬ 
rection variables as proposed here. Simulations are pre¬ 
sented with up to 1 • 10 6 circular particles in a dense 
packing on up to 256 processes. 

The approach presented in this paper improves in 
general the robustness and scalability of previously pub¬ 
lished parallel algorithms. The matrix-free approach fa- 
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cilitates the evaluation of the particle wrenches in par¬ 
allel as suggested in Sect. |6.3| and thus reduces the 
amount of communicated data. The separation of bod¬ 
ies into global and local bodies allows to restrict message- 
exchange communications to nearest neighbors as de¬ 
tailed in Sect. |6.4| and thus maps well to various in¬ 
terconnect networks. Furthermore, the synchronization 


protocol defined in Sect. |6.2| and Sect. |6.5| is not suscep¬ 
tible to numerical errors in contrast to the conventional 
rules which are based on contact locations. Last but not 
least the scaling experiments from Sect. [7] with up to 
2.8 • 10 10 non-spherical particles or 1.1 • 10 10 contacts on 
up to 1.8 ■ 10 6 processes exceed all previously published 
numbers by a factor of 10 3 to 10 4 . 


of accumulators and correction variables enables the 
evaluation of the particle wrenches in parallel, reuses 
partial results and reduces the number of particles that 
need to be synchronized. 

The integration of the positions and orientations is 
entailed by the execution of an exceptionally robust 
synchronization protocol. The key to obtain this ro¬ 
bustness is to add the rank of the parent process and 
the ranks of the shadow copy holders to the state of 
each particle and to explicitly communicate the state 
changes. Only then processes can reliably agree upon 
responsibilities such as contact treatment and particle 
integration without being susceptible to numerical er¬ 
rors. 


9 Summary 

This article presents models and algorithms for per¬ 
forming scalable direct numerical simulations of gran¬ 
ular matter in hard contact as we implemented them 
in the pe open-source software framework for massively 
parallel simulations of rigid bodies. The pe framework 
already has been successfully used to simulate granu¬ 
lar systems with and without surrounding fluid in the 
past 00 . 

The discretization of the equations of motion un¬ 
derlying the time-stepping scheme use an integrator 
of order one. Contacts are modelled as inelastic and 
hard contacts with Coulomb friction. The hard con¬ 
tact model avoids the necessity to resolve the collision 
micro-dynamics and the time-stepping scheme avoids 
the necessity to resolve impulsive events in time. The 
one-step integration can be split into the integration 
of the velocities and the subsequent integration of the 
positions and orientations. 

The velocity integration requires the solution of a 
non-linear system of equations per time step. In or¬ 
der to reduce the size of the system in the first place 
conventional broad-phase contact detection algorithms 
are applied to exclude contacts between intersection 
hulls. To solve the non-linear system of equations the 
subdomain non-linear block Gauss-Seidel is used. The 
numerical solution algorithm is a mixture between a 
non-linear block Gauss-Seidel (NBGS) and a non-linear 
block Jacobi with underrelaxation. In contrast to a pure 
non-linear block Jacobi it only requires a mild under¬ 
relaxation and in contrast to a non-linear block Gauss- 
Seidel it accommodates the subdomain structure of the 
domain partitioning and thus allows an efficient par¬ 
allelization avoiding irregular data dependencies across 
subdomains. The implementation of the subdomain NBGS 
in the pe is matrix-free and thus avoids the expensive as¬ 
sembly of the Delassus operator. Furthermore, the use 


Beyond that, all messages are aggressively aggre¬ 
gated in order to reduce the communication overhead of 
small messages and all messages are restricted to near¬ 
est neighbors. The latter is achieved by splitting bodies 
into local and global bodies and identifying appropriate 
requirements. Both measures improve the scalability of 
the implementation. 

Finally, the scalability was demonstrated for dilute 
and dense setups on three clusters, two of them having 
been in the top 10 of the world’s largest publicly avail¬ 
able supercomputers. The parallel efficiency on Juqueen 
is outstanding. T he inter-island scaling results on Su- 
perMUC are satisfactory, however, the intra-island scal¬ 
ing results show room for improvement. That this is not 
inherently caused by the parallelization approach can 
be seen by inspecting the results of the Emmy cluster, 
whose architecture is close to a single island of Super- 
MUC. 

The largest scaling experiments demonstrate that 
simulations of unprecedented scale with up to 2.8 • 10 10 
non-spherical particles and up to 1.1 • 10 10 contacts are 
possible using up to 1.8-10 6 processes. The systmatic 
evaluation also confirms that good parallel efficiency 
can be expected on millions of processes even if only 
a few hundred particles are allocated to each process 
provided that the computation exhibits a sufficiently 
high computational intensity and the architecture has 
a good interconnect network . 

The favourable scalability results do not account for 
the fact that the NBGS solver does not scale (algorith¬ 
mically) in terms of the number of iterations needed to 
achieve a given error bound. Possible future develop¬ 
ments arise out of that: The convergence rate of multi¬ 
grid methods is independent of the number of unknowns 
and is in that sense optimal. A successful construction 
of such a multigrid method for hard contact problems 
would be invaluable for simulating every-increasing sys¬ 
tem sizes. 




Ultrascale Simulations of Non-smooth Granular Dynamics 


23 


References 

1. Anitescu M, Potra F (1997) Formulating dynamic 
multi-rigid-body contact problems with friction as 
solvable linear complementarity problems. Nonlin¬ 
ear Dynamics 14(3):231-247 

2. Anitescu M, Potra F (2002) A time-stepping 
method for stiff multibody dynamics with contact 
and friction. International Journal for Numerical 
Methods in Engineering 55(7):753-784 

3. Anitescu M, Tasora A (2010) An iterative approach 
for cone complementarity problems for nonsmooth 
dynamics. Computational Optimization and Appli¬ 
cations 47(2):207 235 

4. van den Bergen G (1999) A fast and robust GJK 
implementation for collision detection of convex ob¬ 
jects. Journal of Graphics Tools 4(2):7—25 

5. van den Bergen G (2001) Proximity queries and 
penetration depth computation on 3D game ob¬ 
jects. In: Game Developers Conference, vol 170 

6 . Bogner S, Mohanty S, Rude U (2015) Drag correla¬ 
tion for dilute and moderately dense fluid-particle 
systems using the lattice boltzmann method. Inter¬ 
national Journal of Multiphase Flow 68(0):71 79 

7. Bonnefon O, Daviet G (2011) Quartic formulation 
of Coulomb 3D frictional contact. Technical Report 
RT-0400, INRIA 

8 . Chen D, Eisley N, Heidelberger P, Senger R, Sug- 
awara Y, Kumar S, Salapura V, Satterfield D, 
Steinmacher-Burow B, Parker .1 (2012) The IBM 
Blue Gene/Q interconnection fabric. IEEE Micro 
32(1) :32—43 

9. Cohen J, Lin M, Manoclra D, Ponamgi M (1995) I- 
COLLIDE: An interactive and exact collision detec¬ 
tion system for large-scale environments. In: Pro¬ 
ceedings of the 1995 symposium on Interactive 3D 
graphics, ACM, pp 189-ff 

10. Diebel J (2006) Representing attitude: Euler an¬ 
gles, unit quaternions, and rotation vectors 

11. Esefelcl B (2014) Numerische Integration von 
Melirkorpersystemen mit mengenwertigen Kraftge- 
setzen. Herbert Utz Verlag 

12. Fischermeier E, Bartuschat D, Preclik T, Marechal 
M, Mecke K (2014) Simulation of a hard- 
spherocylinder liquid crystal with the pe. Computer 
Physics Communications 185(12):3156-3161 

13. Gilbert E, Johnson D, Keerthi S (1988) A fast pro¬ 
cedure for computing the distance between complex 
objects in three-dimensional space. IEEE Journal of 
Robotics and Automation 4(2): 193-203 

14. Gilge M, et al (2013) IBM System Blue Gene Solu¬ 
tion Blue Gene/Q Application Development. IBM 
Redbooks 


15. Iglberger K, Rude U (2009) Massively parallel rigid 
body dynamics simulations. Computer Science- 
Research and Development 23(3-4):159-167 

16. Iglberger K, Rude U (2010) Massively parallel 
granular flow simulations with non-spherical parti¬ 
cles. Computer Science-Research and Development 
25(1-2):105-113 

17. Iglberger K, Riide U (2011) Large-scale rigid 
body simulations. Multibody System Dynamics 
25(1):81 95 

18. Jean M (1999) The non-smooth contact dynamics 
method. Computer Methods in Applied Mechanics 
and Engineering 177(3-4) :235-257 

19. Kollmer J, Sack A, Heckel M, Poschel T (2013) 
Relaxation of a spring with an attached granular 
damper. New Journal of Physics 15(9):093,023 

20. Koziara T, Bicanic N (2011) A distributed memory 
parallel multibody contact dynamics code. Interna¬ 
tional Journal for Numerical Methods in Engineer¬ 
ing 87(1-5) :437-456 

21. Leyffer S (2006) Complementarity constraints as 
nonlinear equations: Theory and numerical experi¬ 
ence. In: Optimization with Multivalued Mappings, 
Springer, pp 169-208 

22. Liu C, Jain S (2012) A quick tutorial on multibody 
dynamics. Tech, rep., Georgia Institute of Technol¬ 
ogy 

23. McCalpin J (1995) Memory bandwidth and ma¬ 
chine balance in current high performance comput¬ 
ers. IEEE Computer Society Technical Committee 
on Computer Architecture (TCCA) Newsletter pp 
19-25 

24. McNamara S, Young W (1994) Inelastic collapse in 
two dimensions. Physical Review E 50(1):R28-R31 

25. Miller S, Luding S (2004) Event-driven molecu¬ 
lar dynamics in parallel. Journal of Computational 
Physics 193(1):306-316 

26. Mitarai N, Nakanishi H (2012) Granular flow: Dry 
and wet. The European Physical Journal Special 
Topics 204(1):5-17 

27. Moreau J, Panagiotopoulos P (1988) Nonsmooth 
Mechanics and Applications, vol 302. Springer 

28. Negrut D, Tasora A, Mazhar H, Heyn T, Hahn P 
(2012) Leveraging parallel computing in multibody 
dynamics. Multibody System Dynamics 27(1) :95— 
117 

29. Popa C, Preclik T, Rude U (2014) Regularized so¬ 
lution of LCP problems with application to rigid 
body dynamics. Numerical Algorithms pp 1 12 

30. Sauer J, Schomer E (1998) A constraint-based ap¬ 
proach to rigid body dynamics for virtual reality 
applications. In: Proceedings of the ACM Sympo¬ 
sium on Virtual Reality Software and Technology, 



24 


Tobias Preclik, Ulrich Rude 


pp 153-162 

31. Schindler T, Acary V (2014) Timestepping schemes 
for nonsmooth dynamics based on discontinuous 
Galerkin methods: Definition and outlook. Mathe¬ 
matics and Computers in Simulation 95(0):180-199 

32. Schutte K, van der Waerden B (1952) Das Prob¬ 
lem der dreizehn Kugeln. Mathematische Annalen 
125(l):325-334 

33. Shen Y, Stronge W (2011) Painleve paradox during 
oblique impact with friction. European Journal of 
Mechanics - A/Solids 30(4):457-467 

34. Shojaaee Z, Shaebani M, Brendel L, Torok J, Wolf 
D (2012) An adaptive hierarchical domain decom¬ 
position method for parallel contact dynamics sim¬ 
ulations of granular materials. Journal of Compu¬ 
tational Physics 231(2):612-628 

35. Spahn F, Petzschmann O, Schmidt J, Sremcevic M, 
Hertzsch JM (2001) Granular viscosity, planetary 
rings and inelastic particle collisions. In: Granular 
Gases, Springer, pp 363-385 

36. Studer C (2009) Numerics of Unilateral Contacts 
and Friction: Modeling and Numerical Time Inte¬ 
gration in Non-smooth Dynamics, Lecture Notes 
in Applied and Computational Mechanics, vol 47. 
Springer 

37. Tasora A, Anitescu M (2011) A matrix-free 
cone complementarity approach for solving large- 
scale, nonsmooth, rigid body dynamics. Computer 
Methods in Applied Mechanics and Engineering 
200(5) :439-453 

38. Visseq V, Martin A, Dureisseix D, Dubois F, 
Alart P (2012) Distributed nonsmooth contact do¬ 
main decomposition (NSCDD): Algorithmic struc¬ 
ture and scalability. In: Proceedings of the Interna¬ 
tional Conference on Domain Decomposition Meth¬ 
ods 

39. Visseq V, Alart P, Dureisseix D (2013) High per¬ 
formance computing of discrete nonsmooth con¬ 
tact dynamics with domain decomposition. Inter¬ 
national Journal for Numerical Methods in Engi¬ 
neering 96(9):584-598 

40. Wautelet P, Boiarciuc M, Dupays J, Giuliani S, 
Guarrasi M, Muscianisi G, Cytowski M (2014) Best 
Practice Guide Blue Gene/Q. vl.1.1 edn 

41. van der Weele K, van der Meer D, Versluis M, Lohse 
D (2001) Hysteretic clustering in granular gas. Eu¬ 
rophysics Letters 53(3):328 



